Action-depedent Control Variates for Policy Optimization via Stein's Identity
نویسندگان
چکیده
Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein’s identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.
منابع مشابه
Sample-efficient Policy Optimization with Stein Control Variate
Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein’s...
متن کاملP Olicy O Ptimization with S Econd - O Rder a D - Vantage
Policy optimization on high-dimensional action spaces exhibits its difficulty caused by the high variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the RaoBlackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, the algorithm learns the underlying factorization s...
متن کاملEfficient iterative policy optimization
We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.
متن کاملFficient Iterative Policy Optimization
We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.
متن کاملBackpropagation through the Void: Optimizing control variates for black-box gradient estimation
Gradient-based optimization is the foundation of deep learning and reinforcement learning, but is difficult to apply when the mechanism being optimized is unknown or not differentiable. We introduce a general framework for learning low-variance, unbiased gradient estimators, applicable to black-box functions of discrete or continuous random variables. Our method uses gradients of a surrogate ne...
متن کامل